{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 149,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "if not any(path.endswith('textbook') for path in sys.path):\n",
    "    sys.path.append(os.path.abspath('../../..'))\n",
    "from textbook_utils import *"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(ch:files_granularity)=\n",
    "# Table Shape and Granularity\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As described earlier, we refer to a dataset's *structure* as a mental representation of the data, and in particular, we represent data that have a *table* structure by arranging values in rows and columns. We use the term *granularity* to describe what each row in the table represents, and the term *shape* quantifies the table's rows and columns.\n",
    "\n",
    "Now that we have determined the format of the restaurant-related files, we load them into dataframes\n",
    "and examine their shapes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 150,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "bus = pd.read_csv('data/businesses.csv', encoding='ISO-8859-1')\n",
    "insp = pd.read_csv(\"data/inspections.csv\")\n",
    "viol = pd.read_csv(\"data/violations.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 151,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " Businesses: (6406, 9) \t Inspections: (14222, 4) \t Violations: (39042, 3)\n"
     ]
    }
   ],
   "source": [
    "print(\" Businesses:\", bus.shape, \"\\t Inspections:\", insp.shape, \n",
    "     \"\\t Violations:\", viol.shape)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We find that the table with the restaurant information (the business table) has 6,406 rows and 9 columns. Now let's figure out the granularity of this table. To start, we can look at the first two rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 152,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>business_id</th>\n",
       "      <th>name</th>\n",
       "      <th>address</th>\n",
       "      <th>city</th>\n",
       "      <th>...</th>\n",
       "      <th>postal_code</th>\n",
       "      <th>latitude</th>\n",
       "      <th>longitude</th>\n",
       "      <th>phone_number</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>19</td>\n",
       "      <td>NRGIZE LIFESTYLE CAFE</td>\n",
       "      <td>1200 VAN NESS AVE, 3RD FLOOR</td>\n",
       "      <td>San Francisco</td>\n",
       "      <td>...</td>\n",
       "      <td>94109</td>\n",
       "      <td>37.79</td>\n",
       "      <td>-122.42</td>\n",
       "      <td>+14157763262</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>24</td>\n",
       "      <td>OMNI S.F. HOTEL - 2ND FLOOR PANTRY</td>\n",
       "      <td>500 CALIFORNIA ST, 2ND  FLOOR</td>\n",
       "      <td>San Francisco</td>\n",
       "      <td>...</td>\n",
       "      <td>94104</td>\n",
       "      <td>37.79</td>\n",
       "      <td>-122.40</td>\n",
       "      <td>+14156779494</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2 rows × 9 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   business_id                                name   \n",
       "0           19               NRGIZE LIFESTYLE CAFE  \\\n",
       "1           24  OMNI S.F. HOTEL - 2ND FLOOR PANTRY   \n",
       "\n",
       "                         address           city  ... postal_code latitude   \n",
       "0   1200 VAN NESS AVE, 3RD FLOOR  San Francisco  ...       94109    37.79  \\\n",
       "1  500 CALIFORNIA ST, 2ND  FLOOR  San Francisco  ...       94104    37.79   \n",
       "\n",
       "   longitude  phone_number  \n",
       "0    -122.42  +14157763262  \n",
       "1    -122.40  +14156779494  \n",
       "\n",
       "[2 rows x 9 columns]"
      ]
     },
     "execution_count": 152,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bus.head(2)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These two rows give us the impression that each record represents a particular restaurant. But, we can't tell from just two records whether or not this is the case. The field named `business_id` implies that it is the unique identifier for the restaurant. We can confirm this by checking whether the number of records in the dataframe matches the number of unique values in the field `business_id`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 153,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of records: 6406\n",
      "Number of unique business ids: 6406\n"
     ]
    }
   ],
   "source": [
    "print(\"Number of records:\", len(bus))\n",
    "print(\"Number of unique business ids:\", len(bus['business_id'].unique()))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The number of unique `business_id`s matches the number of rows in the table, so it seems safe to assume that each row represents a restaurant. Since `business_id` uniquely identifies each record in the dataframe, we treat `business_id` as the *primary key* for the table. We can use primary keys to join tables (see {numref}`Chapter %s <ch:pandas>`).\n",
    "Sometimes a primary key consists of two (or more) features.\n",
    "This is the case for the other two restaurant files. \n",
    "Let's continue the examination of the inspections and violations dataframes\n",
    "and find their granularity."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Granularity of Restaurant Inspections and Violations"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We just saw that there are many more rows in the inspection table compared to the business table. Let's take a closer look at the first few inspections:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 154,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>business_id</th>\n",
       "      <th>score</th>\n",
       "      <th>date</th>\n",
       "      <th>type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>19</td>\n",
       "      <td>94</td>\n",
       "      <td>20160513</td>\n",
       "      <td>routine</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>19</td>\n",
       "      <td>94</td>\n",
       "      <td>20171211</td>\n",
       "      <td>routine</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>24</td>\n",
       "      <td>98</td>\n",
       "      <td>20171101</td>\n",
       "      <td>routine</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>24</td>\n",
       "      <td>98</td>\n",
       "      <td>20161005</td>\n",
       "      <td>routine</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   business_id  score      date     type\n",
       "0           19     94  20160513  routine\n",
       "1           19     94  20171211  routine\n",
       "2           24     98  20171101  routine\n",
       "3           24     98  20161005  routine"
      ]
     },
     "execution_count": 154,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "insp.head(4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 155,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "business_id  date    \n",
       "64859        20150924    2\n",
       "87440        20160801    2\n",
       "77427        20170706    2\n",
       "19           20160513    1\n",
       "71416        20171213    1\n",
       "dtype: int64"
      ]
     },
     "execution_count": 155,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(insp\n",
    " .groupby(['business_id', 'date'])\n",
    " .size()\n",
    " .sort_values(ascending=False)\n",
    " .head(5)\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The combination of restaurant ID and inspection date uniquely identifies each record in this table, with the exception of three restaurants that have two records for their ID-date combination. Let's examine the rows for restaurant `64859`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 156,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>business_id</th>\n",
       "      <th>score</th>\n",
       "      <th>date</th>\n",
       "      <th>type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>7742</th>\n",
       "      <td>64859</td>\n",
       "      <td>96</td>\n",
       "      <td>20150924</td>\n",
       "      <td>routine</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7744</th>\n",
       "      <td>64859</td>\n",
       "      <td>91</td>\n",
       "      <td>20150924</td>\n",
       "      <td>routine</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      business_id  score      date     type\n",
       "7742        64859     96  20150924  routine\n",
       "7744        64859     91  20150924  routine"
      ]
     },
     "execution_count": 156,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "insp.query('business_id == 64859 and date == 20150924')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This restaurant got two different inspection scores on the same day! How could this happen? It may be that the restaurant had two inspections in one day, or it might be an error.\n",
    "We address these sorts of questions when we consider the data quality in {numref}`Chapter %s <ch:wrangling>`.\n",
    "Since there are only three of these double-inspection days, we can ignore the issue until we clean the data. So the primary key would be the combination of restaurant ID and inspection date if same-day inspections are removed from the table.\n",
    "\n",
    "Note that the `business_id` field in the inspections table acts as a reference to the primary key in the business table.\n",
    "So `business_id` in `insp` is a *foreign key* because it\n",
    "links each record in the inspections table to a record in the business table.\n",
    "This means that we can readily join these two tables together.  "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we examine the granularity of the third table, the one that contains the violations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 157,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>business_id</th>\n",
       "      <th>date</th>\n",
       "      <th>description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>19</td>\n",
       "      <td>20171211</td>\n",
       "      <td>Inadequate food safety knowledge or lack of ce...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>19</td>\n",
       "      <td>20171211</td>\n",
       "      <td>Unapproved or unmaintained equipment or utensils</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>19</td>\n",
       "      <td>20160513</td>\n",
       "      <td>Unapproved or unmaintained equipment or utensi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39039</th>\n",
       "      <td>94231</td>\n",
       "      <td>20171214</td>\n",
       "      <td>High risk vermin infestation  [ date violation...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39040</th>\n",
       "      <td>94231</td>\n",
       "      <td>20171214</td>\n",
       "      <td>Moderate risk food holding temperature   [ dat...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39041</th>\n",
       "      <td>94231</td>\n",
       "      <td>20171214</td>\n",
       "      <td>Wiping cloths not clean or properly stored or ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>39042 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       business_id      date  \\\n",
       "0               19  20171211   \n",
       "1               19  20171211   \n",
       "2               19  20160513   \n",
       "...            ...       ...   \n",
       "39039        94231  20171214   \n",
       "39040        94231  20171214   \n",
       "39041        94231  20171214   \n",
       "\n",
       "                                             description  \n",
       "0      Inadequate food safety knowledge or lack of ce...  \n",
       "1       Unapproved or unmaintained equipment or utensils  \n",
       "2      Unapproved or unmaintained equipment or utensi...  \n",
       "...                                                  ...  \n",
       "39039  High risk vermin infestation  [ date violation...  \n",
       "39040  Moderate risk food holding temperature   [ dat...  \n",
       "39041  Wiping cloths not clean or properly stored or ...  \n",
       "\n",
       "[39042 rows x 3 columns]"
      ]
     },
     "execution_count": 157,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "viol"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the first few records in this table, we see that each inspection has\n",
    "multiple entries. The granularity appears to be at the level of a violation found in an inspection. Reading the descriptions, we see that if corrected, a date is listed\n",
    "in the description within square brackets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 158,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'High risk vermin infestation  [ date violation corrected: 12/15/2017 ]'"
      ]
     },
     "execution_count": 158,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "viol.loc[39039, 'description']"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In brief, we have found that the three food safety tables have different granularities. \n",
    "Since we have identified primary and foreign keys for them, we can potentially join these tables.\n",
    "If we are interested in studying inspections, we can join the violations\n",
    "and inspections together using the business ID and inspection date.\n",
    "This would let us connect the number of violations found during an inspection to the inspection score.\n",
    "\n",
    "We can also reduce the inspection table to one per restaurant by selecting the most recent inspection for each restaurant. This reduced data table essentially has a granularity of restaurant and may be useful for a restaurant-based analysis. In {numref}`Chapter %s <ch:wrangling>`, we cover these kinds of actions that reshape a data table, transform columns, and create new columns.\n",
    "\n",
    "We conclude this section with a look at the shape and granularity of the DAWN survey data."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DAWN Survey Shape and Granularity"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As noted earlier in this chapter, the DAWN file has fixed-width formatting, and we need to rely on a codebook to find out where the fields are. As an example, a snippet of the codebook in {numref}`Figure %s <DAWN_Age>` tells us that age appears in positions 34 and 35 in a row, and it is categorized\n",
    "into 11 age groups: 1 stands for age 5 and under, 2 for ages 6 to 11, ..., and 11 for ages 65\n",
    "and older. Also,–8 represents a missing value."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{figure} figures/DAWN_Age.png\n",
    "---\n",
    "name: DAWN_Age\n",
    "---\n",
    "\n",
    "Screenshot of a portion of the DAWN coding for age\n",
    "```\n",
    "\n",
    " "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Earlier, we determined that the file contains 200,000 lines and over 280 million\n",
    "characters, so on average, there are about 1,200 characters per line.\n",
    "This might be why they used a fixed-width rather than a CSV format.\n",
    "Think how much larger the file would be if there was a comma between every field!\n",
    "\n",
    "Given the tremendous amount of information on each line, let's read just a few\n",
    "features into a dataframe. We can use the `pandas.read_fwf` method to do this. We specify the exact positions of the fields to extract, and we provide names for these fields and other information about the header and index: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 159,
   "metadata": {},
   "outputs": [],
   "source": [
    "colspecs = [(0,6), (14,29), (33,35), (35, 37), (37, 39), (1213, 1214)]\n",
    "varNames = [\"id\", \"wt\", \"age\", \"sex\", \"race\",\"type\"]\n",
    "dawn = pd.read_fwf('data/DAWN-Data.txt', colspecs=colspecs, \n",
    "                   header=None, index_col=0, names=varNames)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 160,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>wt</th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>race</th>\n",
       "      <th>type</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.94</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5.99</td>\n",
       "      <td>11</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.72</td>\n",
       "      <td>11</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.08</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5.18</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      wt  age  sex  race  type\n",
       "id                            \n",
       "1   0.94    4    1     2     8\n",
       "2   5.99   11    1     3     4\n",
       "3   4.72   11    2     2     4\n",
       "4   4.08    2    1     3     4\n",
       "5   5.18    6    1     3     8"
      ]
     },
     "execution_count": 160,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dawn.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can compare the rows in the table to the number of lines in the file: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 161,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(229211, 5)"
      ]
     },
     "execution_count": 161,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dawn.shape"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The number of rows in the dataframe matches the number of lines in the file. That's good.\n",
    "The granularity of the dataframe is a bit complicated due to the survey design.  Recall that these data are part of a large scientific study, with a complex sampling scheme. A row represents an emergency room visit, so the granularity is at the emergency room visit level. However, in order to reflect the sampling scheme and be representative of the population of all drug-related ER visits in a year, weights are provided. We must apply the weight to each record when we compute summary statistics, build histograms, and fit models. (The `wt` field contains these values.)\n",
    "\n",
    "The weights take into account the chance of an ER visit like this one appearing in the sample. By \"like this one\" we mean a visit with similar features, such as the visitor age, race, visit location, and time of day. Let's examine the different values in `wt`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 162,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "wt\n",
       "0.94     1719\n",
       "84.26    1617\n",
       "1.72     1435\n",
       "         ... \n",
       "1.51        1\n",
       "3.31        1\n",
       "3.33        1\n",
       "Name: count, Length: 3500, dtype: int64"
      ]
     },
     "execution_count": 162,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dawn['wt'].value_counts()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ":::{note}\n",
    "\n",
    "What do these weights mean? As a simplified example,\n",
    "suppose you ran a survey and 45% of your respondents were under 18 years of age, but according to the US Census Bureau, only 22% of the population is under 18. You can adjust your survey responses to reflect the US population by using a small weight (22/45) for those under 18 and a\n",
    "larger weight (78/55) for those 18 and older. To see how we might use these weights, suppose the respondents are asked whether they use Facebook:\n",
    "\n",
    "| Facebook  | < 18       | 18+           | Total \n",
    "| :------------- | -------------: | -------------: |--------: \n",
    "| No    | 1   | 19  | 21    \n",
    "| Yes   | 44  | 35  | 79\n",
    "| Total | 45  | 55  | 100   \n",
    "\n",
    "Overall, 79% of the respondents say they are Facebook users, but the sample is skewed toward the younger generation. We can adjust the estimate with the weights so that the age groups match the population. Then the adjusted percentage of Facebook users drops to:\n",
    "\n",
    "$$(22/45) \\times 44 + (78/55) \\times 35 = 71 $$\n",
    "\n",
    "The DAWN survey uses the same idea, except that it splits the groups\n",
    "much more finely.\n",
    "\n",
    ":::"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is critical to include the survey weights in your analysis to get data that represents the population at large. For example, we can compare the calculation of the proportion of females among the ER visits both with and without the weights:  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 168,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Unweighted percent female: 48.0%\n",
      "  Weighted percent female: 52.3%\n"
     ]
    }
   ],
   "source": [
    "print(f'Unweighted percent female: {np.average(dawn[\"sex\"] == 2):.1%}')\n",
    "print(f'  Weighted percent female:',\n",
    "      f'{np.average(dawn[\"sex\"] == 2, weights=dawn[\"wt\"]):.1%}')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These figures differ by more than 4 percentage points. The weighted version is a more accurate estimate of the proportion of females among the entire population of drug-related ER visits. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes the granularity can be tricky to figure out, like we saw with the\n",
    "inspections data.\n",
    "And at other times, we need to take sampling weights into account, like for the DAWN data.\n",
    "These examples show it's important to take your time and review the data\n",
    "descriptions before proceeding with analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
